-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
connect: rewrite envoy bootstrap on every restart #19787
Conversation
Fixes #19781 Do not mark the envoy bootstrap hook as done after successfully running once. Since the bootstrap file is written to /secrets, which is a tmpfs on supported platforms, it is not persisted across reboots. This causes the task and allocation to fail on reboot (see #19781). This fixes it by *always* rewriting the envoy bootstrap file every time the Nomad agent starts. This does mean we may write a new bootstrap file to an already running Envoy task, but in my testing that doesn't have any impact. *Alternative 1: Use a regular file* An alternative approach would be to write the bootstrap file somewhere other than the tmpfs, but this is *unsafe* as when Consul ACLs are enabled the file will contain a secret token: https://developer.hashicorp.com/consul/commands/connect/envoy#bootstrap *Alternative 2: Detect if file is already written* An alternative approach would be to detect if the bootstrap file exists, and only write it if it doesn't. This is just a more complicated form of the current fix. I think in general in the absence of other factors task hooks should be idempotent and therefore able to rerun on any agent startup. This simplifies the code and our ability to reason about task restarts vs agent restarts vs node reboots by making them all take the same code path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM code-wise. Interestingly, this fix looks similar to what we did in the envoy_version
hook: #16815. When we get some time it might be helpful to see where else we have this kind of logic in the hooks that has forgotten about host restarts.
The TestTaskRunner_EnvoyBootstrapHook_gateway_ok
test is failing, which seems relevant to this change. (There's also a flaky test in the leadership transfer command tests which we know about in #19718).
It's no longer accurate and isn't really related to this test.
// | ||
// Done is useful for expensive operations such as downloading artifacts, or | ||
// for operations which might fail needlessly if rerun while a node is | ||
// disconnected. | ||
Done bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great comment, this actually clarifies a lot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once that last test is passing
Fixes #19781
Do not mark the envoy bootstrap hook as done after successfully running once. Since the bootstrap file is written to /secrets, which is a tmpfs on supported platforms, it is not persisted across reboots. This causes the task and allocation to fail on reboot (see #19781).
This fixes it by always rewriting the envoy bootstrap file every time the Nomad agent starts. This does mean we may write a new bootstrap file to an already running Envoy task, but in my testing that doesn't have any impact.
Alternative 1: Use a regular file
An alternative approach would be to write the bootstrap file somewhere other than the tmpfs, but this is unsafe as when Consul ACLs are enabled the file will contain a secret token:
https://developer.hashicorp.com/consul/commands/connect/envoy#bootstrap
Alternative 2: Detect if file is already written
An alternative approach would be to detect if the bootstrap file exists, and only write it if it doesn't.
This is just a more complicated form of the current fix. I think in general in the absence of other factors task hooks should be idempotent and therefore able to rerun on any agent startup. This simplifies the code and our ability to reason about task restarts vs agent restarts vs node reboots by making them all take the same code path.